Address residual P1+P2s from re-audit of PR #412 by igerber · Pull Request #422 · igerber/diff-diff

igerber · 2026-05-13T00:14:37Z

Summary

Audit follow-up to PR #412. The restored CI reviewer surfaced findings the degraded reviewer missed across all 5 prior rounds:

P1 (REGISTRY + code comment) - the claim that "R does not ship per-path `predict_het` on placebos either, so parity is preserved by deferral" contradicts what R's `did_multiplegt_dyn(..., by_path, predict_het)` dispatcher does: it forwards `predict_het` into each per-path `did_multiplegt_main` call alongside `placebo`, so R may emit per-path placebo heterogeneity rows we do not yet mirror. Rewrite both surfaces as an explicit Python-side deferral, NOT a verified R-parity. TODO row added to track validating R's actual output and either implementing parity or documenting the deviation explicitly.
P2 (REGISTRY rtol claim) - the per-path heterogeneity R-parity paragraph claimed `rtol ~1e-6 on point estimates AND SE`, but parity tests use `BETA_RTOL=1e-6` and `SE_RTOL=1e-5`. Split the claim and note the WLS-denominator/cohort-recentering numerical drift motivating the looser SE bound.
P2 (replicate-weight df_survey refresh test gap) - the existing test `test_per_path_heterogeneity_replicate_weights_propagates_n_valid` would have passed if the new dedicated refresh loop failed to recompute `t_stat` / `p_value` / `conf_int` at the final `df_survey`. Strengthen to call `safe_inference(beta, se, df=df_survey)` on the first finite entry and assert the stored inference fields match.
P2 (paths_of_interest survey gap) - the documented composability of `paths_of_interest + heterogeneity + survey_design` was not regression-locked (all survey-specific tests used `by_path=k`). Add two new tests: analytical Binder TSL coverage with selector-ordering preservation, and the multiplier-bootstrap gate under `paths_of_interest`.

No estimator behavior, weighting, variance/SE, identification, or default statistical surface changed in source - documentation accuracy plus expanded regression coverage only.

Test plan

CI - new tests run as part of the standard slow-test matrix.
Locally verified: `pytest tests/test_chaisemartin_dhaultfoeuille.py::TestByPathHeterogeneity::test_paths_of_interest_heterogeneity_{survey_design_analytical,survey_n_bootstrap_gate} tests/test_chaisemartin_dhaultfoeuille.py::TestByPathHeterogeneity::test_per_path_heterogeneity_replicate_weights_propagates_n_valid -m ''` all pass.

🤖 Generated with Claude Code

The restored CI reviewer surfaced findings the degraded reviewer missed across all 5 prior rounds on PR #412: P1 (REGISTRY + code comment): the claim that "R does not ship per-path predict_het on placebos either, so parity is preserved by deferral" contradicts what R's `did_multiplegt_dyn(..., by_path, predict_het)` dispatcher actually does - it forwards `predict_het` into each per-path `did_multiplegt_main` call along with `placebo`, so R may emit per-path placebo heterogeneity rows we do not yet mirror. Rewrite both surfaces (chaisemartin_dhaultfoeuille_results.py code comment and REGISTRY.md DataFrame-integration paragraph) as an explicit Python- side deferral rather than a verified R-parity. Add a TODO row to track validating R's actual placebo predict_het output and either implementing parity or documenting the deviation explicitly. P2 (REGISTRY rtol claim): the per-path heterogeneity R-parity paragraph claimed "rtol ~1e-6 on point estimates AND SE", but the parity tests use BETA_RTOL=1e-6 and SE_RTOL=1e-5 (one decade looser on SE). Split the claim into the two separate tolerances and note the WLS-denominator/cohort-recentering numerical drift that motivates the looser SE bound. P2 (replicate-weight df_survey refresh): the existing test only checked finite SE; it would have passed if the new dedicated heterogeneity refresh loop failed to recompute t_stat / p_value / conf_int at the final df_survey. Strengthen the test to call `safe_inference(beta, se, df=df_survey)` on the first finite entry and assert the stored inference fields match - this anti-regression covers the dedicated post-call refresh added for path_heterogeneity_ effects. P2 (paths_of_interest survey gap): the documented composability of `paths_of_interest + heterogeneity + survey_design` was not regression- locked - all existing survey-specific tests used `by_path=k`. Add test_paths_of_interest_heterogeneity_survey_design_analytical (verify analytical Binder TSL fits, selector ordering preserved, finite SE per populated (path, l)) and test_paths_of_interest_heterogeneity_ survey_n_bootstrap_gate (verify the multiplier-bootstrap gate applies under paths_of_interest too). No estimator behavior, weighting, variance/SE, identification, or default statistical surface changed in source - documentation accuracy plus expanded regression coverage only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-13T00:19:36Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 7e9f1aa861fd043ea6b276b57c7ec4363d1e5bbe

Overall Assessment

✅ Looks good — no unmitigated P0/P1 findings.

Executive Summary

The diff does not change dCDH estimator logic, weighting, variance/SE computation, identification assumptions, or defaults; it changes documentation/comments and adds regression coverage around by_path heterogeneity.
Affected methodology surface: dCDH per-path heterogeneity testing (predict_het, Web Appendix Section 1.5 / Lemma 7) and the by-path placebo-row export path.
The registry/comment updates correctly stop claiming verified R parity for per-path placebo heterogeneity and instead document the current Python behavior as a tracked deferral. docs/methodology/REGISTRY.md:L643-L645, diff_diff/chaisemartin_dhaultfoeuille_results.py:L1862-L1903, TODO.md:L63
The rewritten parity-tolerance language matches the actual parity test constants (BETA_RTOL=1e-6, SE_RTOL=1e-5). tests/test_chaisemartin_dhaultfoeuille_parity.py:L1404-L1469
The new tests target the right regressions: final df_survey refresh on path_heterogeneity_effects, paths_of_interest + heterogeneity + survey_design, and the survey + multiplier-bootstrap gate under paths_of_interest. tests/test_chaisemartin_dhaultfoeuille.py:L10577-L10703
I could not execute the tests in this environment because pytest is not installed.

Methodology

Severity: P3. Impact: The remaining per-path placebo heterogeneity gap is now accurately documented as a Python-side deferral rather than an asserted R-parity fact; this matches the live behavior, where placebo rows in to_dataframe(level="by_path") still emit NaN het_* fields. Concrete fix: None required for approval; future parity work is already tracked. docs/methodology/REGISTRY.md:L643-L645, diff_diff/chaisemartin_dhaultfoeuille_results.py:L1862-L1903, TODO.md:L63
No P0/P1 methodology findings.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity: P3. Impact: The new TODO entry properly tracks the unresolved placebo-heterogeneity parity work, so it is mitigated under the project’s deferred-work policy rather than remaining implicit. Concrete fix: None required in this PR. TODO.md:L63

Security

No findings.

Documentation/Tests

No findings. The added tests are well-targeted to the touched risks, especially the dedicated final-df refresh loop in fit(). diff_diff/chaisemartin_dhaultfoeuille.py:L3957-L4070, tests/test_chaisemartin_dhaultfoeuille.py:L10577-L10703
Verification note: test execution was not possible here because neither pytest nor python -m pytest is available in the review environment.

Two informational findings from the latest review: 1. Stale `to_dataframe(level="by_path")` docstring at `chaisemartin_dhaultfoeuille_results.py:1530-1558` still claimed placebo `het_*` columns are NaN. Updated to document the post-#422 contract: positive-horizon AND negative-horizon (placebo) rows are both populated when `placebo=True + heterogeneity=` are co-set; placebo rows under `survey_design` remain NaN with a fit-time UserWarning. 2. JSON golden-fixture type instability for empty `placebo_predict_het` / `placebo_horizons` slots. R's `jsonlite::toJSON` serializes plain `list()` as `[]` (array) but populated named lists as `{}` (object), so consumers iterating `.items()` on the slot saw different shapes across scenarios. Fixed both ends: - R-side: extractors initialize empty slots with `structure(list(), names = character(0))` which jsonlite serializes as `{}` even when empty. Verified across 4 scenarios (20, 21, 22, 23) — all `placebo_predict_het` / `placebo_horizons` slots now serialize as objects regardless of population. - Python-side: added `_as_dict` helper in the parity test module as a defensive backstop coercing any non-dict (None / [] / missing) to {}. Used at the two call sites that read optional placebo slots so consumers can call `.items()` uniformly. The golden JSON regenerated; type-stable across all scenarios (verified via `jq` on each scenario's predict_het type). 314 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…tract Two stale in-code documentation surfaces from the latest review: 1. `chaisemartin_dhaultfoeuille.py:980-989` — the `heterogeneity` parameter docstring on `fit()` still said "post-treatment regressions only (no placebo regressions)". Updated to document the post-#422 contract: per-horizon OLS regressions on forward AND backward (placebo) horizons when `placebo=True`; survey_design composes with forward horizons but warns + skips backward horizons until the pre-period cell allocator is derived. 2. `chaisemartin_dhaultfoeuille_results.py:1279-1286` — the heterogeneity summary note didn't mention the survey forward-only fallback. Extended the note to cover the gating semantics so users reading `result.summary()` under `survey_design + heterogeneity` know what they're getting. Both surfaces now match the contract already documented in the API rst, REGISTRY, and CHANGELOG. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

A holistic codex review of the merged igerber#412 + cleanup igerber#422 state surfaced three documentation/test gaps that the per-PR cleanup review path could not see (it only scopes to the cleanup diff). All three are at-most P3 in severity but each is real claim-vs-coverage drift. 1. REGISTRY's top-level `Note (Phase 3 by_path ...)` `to_dataframe( level="by_path")` schema list omits the `het_*` columns (`het_beta`, `het_se`, `het_t_stat`, `het_p_value`, `het_conf_int_lower`, `het_conf_int_upper`) that `_to_dataframe` has always emitted since the Phase 5 heterogeneity wave landed. Add them to the schema list so the registry contract matches the implementation. 2. The two new parity tests (`TestDCDHDynRParityHeterogeneity`, `TestDCDHDynRParityByPathHeterogeneity`) assert only `beta` and `se` from the R golden payload, leaving `t_stat`, `p_value`, `conf_int`, and `n_obs` unpinned. A regression in the inference extraction or final-`df_survey` refresh could ship while parity still passes. Pin `t_stat` at `SE_RTOL` (invariant to critical- value distribution since `t = beta / se`) and `n_obs` exactly. 3. While extending the parity assertions, surfaced a real Python-vs-R structural deviation that was undocumented: `_compute_heterogeneity_test` passes `df=None` to `safe_inference`, so Python uses the normal Z critical value (~1.96) for `p_value` and `conf_int`. R `did_multiplegt_dyn(..., predict_het)` uses the t-distribution with df = n - k from the WLS regression. The structural gap produces ~0.1-2% rtol gaps on CIs and p-values that exceed `SE_RTOL` (verified empirically on the parity fixture: CI gap ~0.17% on h=1). Document the deviation in the heterogeneity R-parity Note. Pin only `beta`, `se`, `t_stat`, `n_obs` in the parity tests; `p_value` and `conf_int` parity intentionally skipped. Add a TODO row tracking the optional df-threading work. No estimator behavior, weighting, variance/SE computation, or default-statistical surface changed - documentation accuracy plus expanded regression coverage only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Closes TODO igerber#422 + pilot-412 in a single PR (same surface, same R fixture, same parity story). Phase 0 probe verified R behavior: did_multiplegt_dyn(by_path, predict_het, placebo) emits per-path heterogeneity OLS results on backward (placebo) horizons via R's per-by_level dispatcher (DIDmultiplegtDYN:::did_multiplegt_main placebo block at the `effect = matrix(-i, ...)` rbind site). New scenario 22 in benchmarks/R/generate_dcdh_dynr_test_values.R captures this with predict_het=list("het_x", c(-1)) — the c(-1) sentinel triggers "compute heterogeneity for ALL forward (1..effects) AND ALL placebo (1..placebo) positions" per the R source path read at script time. Phase 1A implementation (non-survey): _compute_heterogeneity_test gains a placebo: int = 0 parameter and iterates forward (1..L_max) and backward (-1..-placebo) horizons in a single loop. Explicit `if out_idx < 0: continue` eligibility guard prevents numpy negative-index silent wrap on N_mat[g, out_idx] when F_g - 1 + l_h < 0. _compute_path_heterogeneity_test forwards the param; fit() passes placebo=L_max if self.placebo else 0 to both global and per-path call sites. to_dataframe(level="by_path") placebo rows now read het_* values from path_heterogeneity_effects negative-int keys (mirroring the existing path_placebo_event_study negative-key convention) instead of the pre-PR hardcoded NaN-fill. Survey gate: when survey_design is active AND placebo > 0 + heterogeneity is requested, _compute_heterogeneity_test raises NotImplementedError eagerly with a documented message. The Binder TSL cell-period allocator's REGISTRY justification is tied to post-period attribution; backward-horizon attribution puts ψ_g mass on a pre-period cell, which is a separate library-extension claim that needs its own derivation. Forward-horizon predict_het + survey continues to work unchanged. Pre-period allocator derivation tracked as a new follow-up TODO row. Phase 2 (df threading): _compute_heterogeneity_test now passes df = n_obs - n_params to safe_inference on the non-survey OLS path, matching R did_multiplegt_dyn(predict_het=...)'s t-distribution inference (qt(0.975, df.residual(model)) site). Pre-PR Python used df=None (Z critical), producing 0.1-2% rtol gaps on p_value/conf_int vs R. Existing forward-horizon parity tests now pin t/p/CI at INFERENCE_RTOL=1e-4 (was unpinned). Rank- deficient designs use design.shape[1] as df denominator (pre-drop column count); fully rank-deficient is NaN-short-circuited by the existing guard. Near-rank-deficient edge case tracked as a new Low TODO follow-up. R parity: scenario 22 (multi_path_reversible_predict_het_with_placebo, placebo=2, effects=3, by_path=3) pinned at BETA_RTOL=1e-6/SE_RTOL=1e-5 for beta/se/t_stat/n_obs and INFERENCE_RTOL=1e-4 for p_value/conf_int across 3 paths × (3 forward + 2 placebo) = 15 horizons. Cross-surface tests (TestByPathPredictHetPlacebo): placebo het column population, survey-gate NotImplementedError, forward+survey anti-regression, out_idx<0 eligibility guard, single-path telescope (path_heterogeneity_effects[(only_path,)] == heterogeneity_effects bit- exactly), summary rendering. The two existing local-invariant tests (test_*_inference_matches_safe_inference) refactored to verify SE-derivation wiring (t_stat=beta/se, conf_int symmetric around beta, p_value in [0,1]) without back-deriving n_params. REGISTRY: heterogeneity Z-vs-t deviation note replaced with positive "R parity (post-2026-05-15 df threading)" framing including the rank- deficient caveat. New "Per-path placebo heterogeneity" Note documents the R parity, syntax requirements (c(-1) sentinel), survey gate, and test anchors. CHANGELOG entry under [Unreleased]. llms-full.txt by_path entry extended with placebo het composition + survey-gate mention. API rst extended with the same. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber and others added 2 commits May 12, 2026 20:14

Fix PR reference in deferred-placebo-het TODO row (#421 -> #422)

7e9f1aa

igerber added the ready-for-ci Triggers CI test workflows label May 13, 2026

igerber merged commit 9a2435a into main May 13, 2026
31 of 32 checks passed

igerber deleted the fix-audit-412 branch May 13, 2026 10:25

This was referenced May 14, 2026

Address #412 holistic re-audit residuals (R2) #430

Merged

dCDH heterogeneity: per-path + global placebo predict_het R-parity + df threading #449

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Address residual P1+P2s from re-audit of PR #412#422

Address residual P1+P2s from re-audit of PR #412#422
igerber merged 2 commits into
mainfrom
fix-audit-412

igerber commented May 13, 2026

Uh oh!

github-actions Bot commented May 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

igerber commented May 13, 2026

Summary

Test plan

Uh oh!

github-actions Bot commented May 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant